12 research outputs found
Context-Aware Zero-Shot Recognition
We present a novel problem setting in zero-shot learning, zero-shot object
recognition and detection in the context. Contrary to the traditional zero-shot
learning methods, which simply infers unseen categories by transferring
knowledge from the objects belonging to semantically similar seen categories,
we aim to understand the identity of the novel objects in an image surrounded
by the known objects using the inter-object relation prior. Specifically, we
leverage the visual context and the geometric relationships between all pairs
of objects in a single image, and capture the information useful to infer
unseen categories. We integrate our context-aware zero-shot learning framework
into the traditional zero-shot learning techniques seamlessly using a
Conditional Random Field (CRF). The proposed algorithm is evaluated on both
zero-shot region classification and zero-shot detection tasks. The results on
Visual Genome (VG) dataset show that our model significantly boosts performance
with the additional visual context compared to traditional methods
Comprehension-Guided Referring Expressions
We consider generation and comprehension of natural language referring
expression for objects in an image. Unlike generic "image captioning" which
lacks natural standard evaluation criteria, quality of a referring expression
may be measured by the receiver's ability to correctly infer which object is
being described. Following this intuition, we propose two approaches to utilize
models trained for comprehension task to generate better expressions. First, we
use a comprehension module trained on human-generated expressions, as a
"critic" of referring expression generator. The comprehension module serves as
a differentiable proxy of human evaluation, providing training signal to the
generation module. Second, we use the comprehension module in a
generate-and-rerank pipeline, which chooses from candidate expressions
generated by a model according to their performance on the comprehension task.
We show that both approaches lead to improved referring expression generation
on multiple benchmark datasets